Comments on “ Lake Woebegone Twenty Years Later by J. J. Cannell, MD. 


D. J. McRae, Ph.D. 


J. J. Cannell’s article on the so-called “Lake Woebegone” effect for K-12 educational 
testing systems is mostly an historical account of technical issues and policy 
considerations that led in part to development of new types of test instruments for K-12 
testing, i.e., standards-based tests. However, the article also comments on current testing 
practices, and charges that some of the technical issues and policy considerations that led 
to the Lake Woebegone effect are still in place. In particular, the Cannell article focuses 
on the California Standardized Testing and Reporting (STAR) program in recent years 
(pp 8-11 of the article). 

In these comments, I’ll focus on Cannell’ s comments on STAR. There perhaps is much 
of interest in the earlier historical treatment of the Lake Woebegone set of issues, but I’ll 
resist muddying the water with those issues and focus more narrowly on the California 
STAR issues raised by Cannell. 

In the article, Cannell argues that there are three testing system practices that affect the 
credibility of current STAR standards-based test results: (1) Many of the test questions 
are the same from year-to-year, allowing teachers to teach directly to test items rather 
than focus on teaching the content standards that underlie the test; (2) Test preparation 
materials are laced with test questions that unnecessarily narrow the curriculum; and (3) 
Teachers and school administrators administer the tests, thus allowing for outright 
cheating and/or undue influence on results. 

Let me address these three practices in reverse order. 

For practice # 3, Cannell suggests California use outside or independent proctors to 
administer the STAR tests, like the standardized tests that Cannell cites early in the 
article (college entrance, medical school admissions, and medical licensure tests). If 
money were not a consideration, I would find no fault with Cannell’ s argument. 
Certainly, using proctors who have a self-interest in the outcomes of the tests is a less 
than ideal practice, allowing for the possibility that the proctors outright cheat by helping 
students with the test (contrary to the test administration directions) or by changing 
answers after the fact or whatever. When instances of outright cheating do occur, they 
are widely publicized. However, instances of “adult testing irregularities” 1 have quite 
low incidence in K-12 testing programs. For the California STAR program, with more 
than 200,000 test proctors, only 50 to 100 such instances are reported each year for 


1 This is the terminology used to describe failure to follow the prescribed test administration protocol by 
adults administering the tests; the terminology includes both intentional cheating as well as unintentional or 
inadvertent mistakes made during the test administration process. 
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investigation, only a handful are confirmed as incidents that affect the validity of the test 
results and in these cases the scores are nullified. The sanctions for teachers or school 
administrators involved in “adult testing irregularities” are quite severe. Most folks agree 
that incidents of outright cheating by proctors of K-12 tests do not have such weight as to 
call for a major reform of this testing practice. 

Nevertheless, it should be acknowledged that use of outside or independent proctors 
would be a testing practice that would improve the credibility of K-12 testing program 
results. But, such a practice would be extremely costly. The STAR program currently 
tests just fewer than 5 million students per year in grades 2-11 in California at an 
operational cost of roughly $60 million per year, or $ 12/student. By way of contrast, the 
medical school admissions test that Cannell cites currently costs $210/student. Some of 
that additional cost may be traced to additional test forms being developed, and 
psychometric overlap required to insure that the scores are comparable from test form to 
test form and from one test administration to another. But most of the increased cost is 
simply due to the administrative logistics needed to hire outside or independent proctors, 
to arrange for secure test sites, etc. The per student cost for the medical school 
admissions test is more than 17 times higher than the per student cost for STAR; if one 
requested a STAR budget from California lawmakers to accommodate the testing 
practices found in the exam system cited by Cannell, the request would be a full billion 
(with a b) dollars, rather than the current $60 million dollars. Such a request would be 
dead on arrival in the California legislature. 

It should be acknowledged there are less costly schemes to counteract potential cheating 
by proctors than the medical school admissions test scheme cited by Cannell. Indeed, 
one such practice is discussed as practice # 1 below. However, wholesale removal of 
teachers serving as test proctors for their own students would not be a measured public 
policy solution given the scope of this potential problem. The bottom line is that K-12 
testing systems, for economic reasons, are designed with the assumption that teachers and 
school administrators will handle test proctoring duties honestly and ethically. And in the 
vast majority of cases, this is the case. 

Practice # 2 cited by Cannell, the use or overuse of test preparation materials too closely 
matched to actual test questions, is a more serious issue (in my mind) than the issue of 
outside proctors. Actually, this is not a testing system issue; it is an instructional practice 
issue at its core. I am fond of saying that the best test preparation practice for a 
standards-based test is good solid standards-based instruction. I should follow that 
statement by saying that good solid standards-based instruction does not involve heavy 
use of test preparation materials. The appropriate use of test preparation materials should 
be simply to familiarize students with the formats likely to be encountered on a test. 

However, it should be acknowledged that one of the unintended instructional practices 
that surfaces far too frequently in schools involved in high stakes testing programs is the 
overuse of test preparation materials too closely matched to actual test questions. In 
California, there is statutory language prohibiting use of instructional materials designed 
to raise test scores without raising the underlying achievement measured by the test. 
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However, it is one thing to put such words in legislative language, and it is another thing 
to enforce such language. There are no testing police out there to check on instructional 
practice, and hence there are no real sanctions for misuse of test preparation materials. 

The fundamental assumption for the design of high stakes K-12 testing systems is that 
teachers and school administrators will make honest and ethical and appropriate use of 
test preparation materials, and not abuse their use. Unfortunately, too often the line 
between appropriate use and inappropriate abuse is obscured in practice. For 
highlighting this practice, I give Cannell at least half credit, though I do not agree with 
the rhetoric he uses to describe the overall effect of this practice nor do I agree that abuse 
of test preparation materials totally invalidates the information that comes from a K-12 
system such as STAR. Rather, I would agree with the language of the Attorney General 
from Oklahoma who Cannell cites later in his article, that K-12 testing systems “lend 
themselves to being compromised” by such practices. For the longer term, widespread 
acceptance of the proposition that good solid standards-based instruction yields far better 
results on high stakes tests than weak test prep oriented instruction will discourage 
Practice # 2. 

Practice # 1, the repeated use of test questions from year-to-year, is a testing system 
practice that has both pros and cons. First, it should be noted that kids do not see the 
same test questions year after year. Kids move on (for the most part) to new grade levels 
each year, and thus are exposed to an entirely new set of questions on the California 
STAR statewide tests each year. Kids who are retained do see repeated test questions, 
but research has shown that test questions administered more than 6 months apart do not 
have a sufficient memory factor to invalidate test results. But, Cannell is not concerned 
about kids seeing repeated test questions. He is concerned that teachers have access to 
test questions and those teachers then use this knowledge to teach individual test items to 
the new crop of kids they have the next year. 

In one sense, this is again an instructional issue rather than a testing issue. In this sense, 
it is at its core an issue of honesty and ethics. But the testing system also has some 
responsibility here. In designing a testing system, the strongest way to provide for gain 
score data (i.e., year-to-year comparisons) is to use repeated test items. Repeated use of 
good test items is also economical. But repeated use of test questions does open the door 
for potential item exposure that can compromise the test results. The classic way a 
testing system can combat repeated use of individual test questions is to design multiple 
alternate forms for tests, forms that measure the same thing but use differing test 
questions. This is a routine practice with high stakes higher education admissions tests; it 
is a practice sometimes used with K-12 tests, but not often enough. For the STAR 
standards-based tests, it has not been used to date. Cannell is accurate when he indicates 
that roughly 50 percent of the questions from a given form of a STAR standards-based 
test are repeated the following year; it would be a better practice to have multiple 
alternate forms for STAR standards-based tests, and to randomly assign such forms to 
schools. Such a practice would decrease the potential that any given teacher would be 
administering a significant number of repeated questions each year, and thus discourage 
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teachers using their access to last year’s tests to “teach to test items” and thus 
compromise the credibility of STAR test results. 

Use of multiple test forms is also a cost issue. The cost is primarily for additional test 
development, not additional test administration (though there would be some marginal 
increase in test administration costs). The additional cost for routine use of alternate 
forms may increase the cost of STAR by perhaps 10 or 20 percent. 2 For Practice #1,1 
would agree with Cannell that the testing system can and should do more to discourage 
the unintended practices that may come from repeated use of test questions from year-to- 
year. 

I would have one additional note on a detail in Cannell’ s paper. On page 9, Cannell cites 
STAR scores from the Stanford Achievement Test from 1998 to 2002 and compares them 
to scores from the California Achievement Test from 2003, and concludes that the reading 
and language scores “plummeted” in 2003. The scores compared are apples and oranges 
since they are based on publisher norms samples from differing years and for different 
tests. The STAR test vendor supplied a conversion table in 2003 to convert the SAT 
scores from 1998 to 2002 to the CAT score scale. When the converted scores are 
analyzed, one finds that the California reading and language scores from the national 
normed portion of STAR increased modestly from 2002 to 2003, following the trend 
previously identified from 1998 to 2002. 

Allow me to complete these comments with several additional observations. 

First, while I disagree or perhaps partially agree with Cannell on the specific testing 
system practices he identifies, I should comment that there are additional testing practices 
specific to the STAR standards-based tests that could be improved. In particular: 

• I would note that the year-to-year comparisons of scores could be put on a more 
robust basis by paying greater attention to the assumptions required for the annual 
“equating” study that is conducted to put subsequent year test forms on the same 
scale of measurement. This is a technical issue, but it greatly affects the 
interpretation of test score gains from the STAR system. 

• The STAR standards-based tests do not permit comparisons from grade to grade 
[they do not have the technical property called “vertical scaling”]. The STAR 
system would be on stronger technical grounds, particularly for use for 
accountability system calculations involving year to year gain scores, if it was 
designed to incorporate vertical scaling. 

• The STAR system could have better reporting for parents and the public, 
particularly via the availability of “exemplars” to illustrate what performance 
levels such as Below Basic, or Basic, or Proficient, or Advanced mean. 


2 Test development costs tend to be “fixed" costs that are independent of how many students are tested; 
with the large number of students tested under the California STAR system, test development costs may be 
amortized over the large number of students and thus do not increase overall testing program costs the same 
way that operational costs (such as use of outside proctors) increase overall costs. 
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• The STAR system could be coordinated with the high school exit testing system 
(CAHSEE) and the English Language Development testing system (CELDT) to 
eliminate redundant testing. 

• The STAR standards-based system could avoid using near random scores to 
distinguish between Far Below Basic and Below Basic performance levels, thus 
permitting accountability index scores to increase without documented increases 
in achievement levels. 

This is not the time or space to go into these improvements in greater detail, but it is 
appropriate to mention there are additional testing practices that need attention to 
improve the California STAR system. 

Second, I cannot help but make a few comments on the broader issues mentioned in the 
historical portion of Cannell’s paper: 

• On page 2, Cannell seems to make the assumption that public school tests should 
have the same properties as other standardized tests he is familiar with, to wit 
college entrance exams, medical school admissions tests, medical licensure tests. 
As an educational measurement specialist with more than 35 years experience, I 
have to comment there are many brands of standardized tests with each built to 
have specified properties unique to their intended applications. The most widely 
used K-12 tests during the 60’s and 70’s and 80’s and at least half of the 90’s 
were nationally normed commercial tests (NRTs); they were built to have certain 
properties that were appropriate for their intended use. When high stakes 
accountability use became prominent in the mid- to late-90’s, the NRTs have been 
gradually replaced by standards-based tests custom designed to measure each 
state’s unique academic content standards. The standards-based tests have their 
own set of properties that differ in many ways from the NRT properties. And 
both of these types of tests have properties that differ from the college entrance or 
medical school admissions tests that Cannell cites. Assuming that all 
standardized tests are built with the same set of properties in mind is not an 
accurate assumption. 

• On page 4, Cannell seems to equate blatant cheating by adults [such as erasing 
wrong answers and marking correct answers after students have completed their 
tests] with practices such as abuse of test preparation materials. The two sets of 
activities are not in the same domain. Blatant cheating is dishonest, unethical, and 
illegal; poor instructional practice needs to be corrected, but is not in the same 
category as blatant cheating. 

• On page 8, Cannell accuses test publishers of collaborating with others who 
provide test preparation materials. I’d like to say Cannell’ s accusation is 
balderdash, but my own view is that sometimes test publishers have become 
involved in the publishing of test preparation materials to the detriment of their 
efforts to publish good testing materials. There is a current trend for test 
publishers to become involved in what are called “formative” testing systems; 
these are instructional tests (not accountability or high stakes tests) that can be 


5 



McRae, Response to “Lake Woebegone,” Twenty Years Later 


6 


misused by turning out inappropriate test preparation materials. As a former test 
publisher, I have substantial concerns about this recent trend. 

In his article, Cannell laments that his challenge to the K-12 testing establishment in the 
1980’ s went for naught. I would disagree. In fact, in my opening sentence for this set of 
comments, I gave partial credit to Cannell for the development of new types of testing 
instruments for the K-12 testing system in the U.S. in the 1990’s. Cannell uses 
flamboyant rhetoric, with words like “cheat” and “corrupted” that tend to obfuscate 
underlying legitimate testing system issues that he raises. Perhaps such rhetoric is needed 
to get attention for the issues he raises, particularly mainstream media attention. But, if 
one discards the rhetoric, Cannell raises issues that need to be addressed in the design of 
K-12 testing systems in the United States, and in this context he provides a valuable 
voice. I would not agree with Cannell’ s characterization that America has a “corrupt 
testing infrastructure.” Rather, I would offer that America has an imperfect testing 
infrastructure, one that earnestly attempts to serve multiple masters with quality testing 
systems, one that has its warts and blemishes at any given moment in time, but one that 
also improves over time, as is the American way. 


Doug McRae is an Educational Measurement Specialist who resides in Monterey, 
California. He earned a Ph.D. in Psychometrics from the University of North Carolina 
at Chapel Hill, and has served in various capacities in the K-12 testing industry for more 
than 35 years. He served as Vice-President for Publishing at CTB McGraw-Hill in the 
early 1990’ s, overseeing development of standardized tests administered to 15 to 20 
million K-12 students annually. He cdso served as Senior Advisor for the initial 
development of the Standardized Testing and Reporting Program (STAR) standards- 
based tests in California in the late 1990’ s. 


6 



